We will, again, use data from Gapminder for the exercises on relational data.
There is an R package named gapminder that contains a selection of the data from Gapminder.
Our aim for the following exercises is to use the individual data files from Gapminder that we have used in some of the previous exercises to create combined datasets that resemble the one contained in the gapminder package.
Before we can begin with these exercises, we need to load the datasets. To do this you can simply use the following chunks (that repeats some parts of the previous exercises).
library(readr)
library(dplyr)
gap_cont <- read_csv("../data/gapminder/countries_continent.csv")
gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
gap_pop <- read_csv("../data/gapminder/population_total.csv") %>%
rename(country = "Total population")
gap_gdp <- read_csv("../data/gapminder/gdppercapita_us_inflation_adjusted.csv")
First of all, let’s have a look at the Gapminder data that is included in the gapminder package.
gapminder package and have a look at the dataset it contains. The dataset we want is simply called gapminder.
In the following, we will use different joins to create datasets that contain the same set of variables. We will create two versions of the combined dataset.
Before we do this, however, we want to explore the overlap and discrepancies between the individual datasets. This is somewhat easier to do with the datsets in wide format (as each country name only appears in one row in thise).
anti_join() for this task. To just get the country names, you can select the country variable.
For the following series of joins we want the data in long format again. Just copy, paste, and run the code below to transform the datasets accordingly.
library(tidyr)
gap_life <- gap_life %>%
gather(-country, key = "year", value = "lifeExp") %>%
mutate(year = as.integer(year))
gap_pop <- gap_pop %>%
gather(-country, key = "year", value = "pop") %>%
mutate(year = as.integer(year),
pop = as.integer(pop))
gap_gdp <- gap_gdp %>%
gather(-country, key = "year", value = "gdpPercap") %>%
mutate(year = as.integer(year))
As stated above, we want to create two different versions of the combined datasets: One without missing data andone with as many observations (rows) as possible.
Create the two versions of the combined dataset described above using one type of mutating join for each one.
For the one without missing data you should start with the dataset with the largest number of countries in it (gap_pop), and then join the other datasets in descending order of the number of countries they contain (gap_life, gap_gdp, gap_cont). This datasets also requires to additional (pipe) steps to ensure that a) it contains no NAs and b) its variables are in the same order as in the gapminder package dataset.
The types of mutating joins that you should use are inner_join() and full_join. The variables you should join on are country and year.
Now that we have created the two combined datasets, let’s see how much they differ from each other.
nrow() to get the number of resulting observations for the set operation.
Another way to explore the differences between the datasets could be to look at some of the summary statistics for them. How to create summary statistics with the tidyverse will be the topic of the next session.